[Day 6] Data Model(3) - Graph-Like Data Model - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 12 屆 iThome 鐵人賽

DAY 6

AI & Data

資料工程師修煉之路系列第 6 篇

[Day 6] Data Model(3) - Graph-Like Data Model

12th鐵人賽 data engineering data engineer

tshine73

2020-09-21 20:19:45

4122 瀏覽

分享至

接續 Day 5

Graph Data Model 初探

最後要來談談 Graph Data Model，若你軟體的資料關係是大部份一對多，使用 Document Model 就很足夠了，但若多對多關係是佔你軟體資料中的很大部份呢？

Relational Model 可以處理較簡單的多對多關係，但如果較複雜的話就不太適合，例如朋友之間的關係，此時看起來直覺多的 Graph Model 就是個好選擇了，一個 Graph 通常分 2 個型態的物件：

Vertices (也可稱 Nodes 或 Entities)
Edges (也可稱 Relationships 或 Arcs)

用不同場景舉例的話則能分別代表不同意思：

社群 Graph

Vertices 就是人，Edges 就是人之間的關係
網頁 Graph

Vertices 就是網頁，Edges 就代表每個網頁之間的連結
道路或鐵路 Graph

Vertices 就是交叉路口，Edges 就代表每個路口間的路

在學資料結構時應該都有學到圖論的演算法，這些算法能幫助操作圖，例始找到 2 個路口間的最短距離，或者看哪個網頁被連結的最多，讓搜尋引擎能做排序等等。

除此之外，同一個 Graph 裡能用多種類型的資料，例如 Facebook Graph Model 的 Vertices 就有人、地點、活動、評論等等，如下圖，每一個長方形都是 Vertices，但同時有人和地點 2 種類型，彼此用 Edge 代表某種關係，如 Lucy 嫁給 Alain，2 人一起住在倫敦。

figure_2-5

接下來會介紹 2 種不同的 Graph Data Model 架構，分別是 property graph (Neo4j, Titan 實作) 跟 triple-store (Datomic, AllegroGraph 實作)，然後跟如何做 Query。

Property Graphs

在 property graph model 裡，每一個 vertex 都包含：

唯一識別 ID
一個連出去的 edges set
一個連進來的 edges set
一個 key-values 的集合 (collection)

每一個 edge 都包含：

唯一識別 ID
起點 vertex
終點 vertex
一個標記這是什麼關係的 label
一個 key-values 的集合 (collection)

這可以用以下 2 個 relational 表格來示範 (PostgreSQL schema)，連出去跟連進來的 edges sets 可直接用查詢 edges 表格來產生：

CREATE TABLE vertices (
    vertex_id integer PRIMARYKEY, 
    properties json
);

CREATE TABLE edges (
    edge_id integer PRIMARY KEY,
    tail_vertex integer REFERENCES vertices (vertex_id), 
    head_vertex integer REFERENCES vertices (vertex_id), 
    label text,
    properties json
);

CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);

Graph model 可以很簡單的新增不同類型的 Vertex，以圖 2-5 舉例，我們可以新增食物的 Vertex 和喜不喜歡吃的 Edge。

Cypher Query Language

Cypher 是一個 property graph 用的宣告式的查詢語言，為了 Neo4j 建立的，我們可以用下面這個 Cypher 語法來新增圖 2-5 的 Graph 資料到 Graph database 裡，其他的 Vertex 跟 Edge 可用類似的語法建立資料，

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location      {name:'United States', type:'country'  }),
  (Idaho:Location    {name:'Idaho',         type:'state'    }),
  (Lucy:Person       {name:'Lucy' }),
(Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica), (Lucy) -[:BORN_IN]-> (Idaho)

當資料都建完時，就可用 Cypher 來做一些有趣的查詢，例如 有哪些人是從 US 搬到 Europe？ 我們可以這樣查詢：

MATCH
	(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
	(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'}) 
RETURN person.name

先找到在 US 出生的人：

(person) -[:BORN_IN]-> () 從 person 的 vetex 找到任一 edge label 有 BORIN_IN，右邊的 () 代表不指定終點 vertex
在用 [:WITHIN*0..] edge 建成一條鍊找到符合 name 為 US 的 Vertex

然後用一樣邏輯找到現在住在 Europe 的人，最後返回同時 Match 的人名。

Cypher Query In SQL

你也可以把資料存到前面舉例過的 Relational database 裡，然後又 SQL來查詢 **有哪些人是從 US 搬到 Europe？**但查起可能就沒這麼方便了，

在 Realtional database 中，我們也是可以用 Join 的方式去 traverse (遍歷) 所有 Vertex，但寫起來會像這樣：

WITH RECURSIVE
-- in_usa is the set of vertex IDs of all locations within the United States
  in_usa(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'United States'
    UNION
      SELECT edges.tail_vertex FROM edges
      JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
      WHERE edges.label = 'within' 
  ),
-- in_europe is the set of vertex IDs of all locations within Europe
  in_europe(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'Europe'
    UNION
      SELECT edges.tail_vertex FROM edges
      JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
      WHERE edges.label = 'within' 
  ),
-- born_in_usa is the set of vertex IDs of all people born in the US
  born_in_usa(vertex_id) AS (
      SELECT edges.tail_vertex FROM edges
      JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
      WHERE edges.label = 'born_in' 
  ),
-- lives_in_europe is the set of vertex IDs of all people living in Europe
  lives_in_europe(vertex_id) AS (
      SELECT edges.tail_vertex FROM edges
      JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
      WHERE edges.label = 'lives_in' 
  )

SELECT vertices.properties->>'name'
FROM vertices
-- join to find those people who were both born in the US *and* live in Europe 
JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;

我也懶的解釋這個 SQL 邏輯了XD (頭疼中)，這還只是上面那個簡單問題的查詢，更不消說比較複雜的查詢了。

如 Day 4 最一開始提到的那樣，為你的軟體選擇適合的 data model 是再重要不過了！

Day 7 待續吧！